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Generalised  additive  models^Hastie  And  TibshiFani,  1086  Statistical  Science)  extend  the 
class  of  generalized  linear  models  by  allowing  an  arbitrary  smooth  function  for  any  or  all  of  the 

covariates.  The  functions  are  estimated  by  the  local  scoring  procedure,  using  a  smoother  as  a 

z' — ^ 

building  block  in  an  iterative  algorithm.  Ha  this  papetywe  utilizes  a  cubic  spline  smoother  in  the 

«\ 

algorithm  and  show  how  the  resultant  procedure  can  be  viewed  as  a  method  for  automatically 
smoothing  a  suitably  defined  ^partial  residual  ,  and  more  formally,  a  method  for  maximizing 

ru  °  .9^ 

a  penalized  likelihood.  We  also  examine  convergence  of  the  inner  (“backlit ting*)  loop  in  this 
case  and  illustrate  these  ideas  with  some  binary  response  data. 


Key  words:  Generalized  additive  model,  spline  smoothing'  non-par ametric  regression^ 
partial  residual,  penalized  likelihood. 


1.  Introduction. 

This  paper  describes  a  technique  for  non-parametricaliy  estimating  covariate  effects  in  any 
generalized  linear  model.  In  order  to  motivate  the  technique  we’ll  consider  an  example.  Risch, 
Weiss,  Lyon,  Daling  and  Liff  (1983)  analyzed  a  case-control  study  of  ovarian  cancer.  Their 
data  consists  of  987  observations  on  the  incidence  (y)  of  ovarian  cancer  (l=case  0=control), 
and  17  covariates.  For  illustration  here  we  will  focus  on  two  covariates:  ovulatory  age  (OA), 


an  — of  the  nnmbar  of  jraan  that  a  woman  haa  ovulated  in  her  Ufa  and  age  at  end  of 
ovulatory  life  (AEO).  We  shall  fit  a  number  of  logistic  models  to  these  data,  with  log[P(Y  — 
1  |OA,  AEO)/(l-P(X  =  1 1 0 A,  AEO))]  =  a+/i(OA)+/j(AEO)  for  some  functions  /i(OA) 
and  ft(AEO)  .  A  linear  logistic  St  (that  is  /i(OA)  =  fii OA,fa{AEO)  =  jUiAEO)  gave  &  =  **, 
=  .00  and  £,  =  . -07  both  effects  being  highly  significant.  Denote  the  fitted  probabilities 
by  p.  In  order  to  investigate  the  adequacy  of  the  linear  forms  for  /i(.)  and  /j(.),  Landwehr, 
Pregibon  and  Shoemaker  (1984)  suggested  smoothing  the  partial  residual 

Zl  =  &  +  $l0A+  '/(T-J)  (1) 

against  OA,  and  similarly  for  /*(.).  Thu  is  justified  by  the  fact  that  if  in  reality  log[P{y  = 
1  |OA,AEO)/(1  -  P(Y  as  1 1 0 A,  AEO))]  =  a  +  fa AEO  +  /i(OA),  and  /i(OA)  is  orthogonal 
to  AEO ,  then  E(Z  \ OA)  «  /i(OA)  and  likewise  for  ft(AEO).  A  smooth  of  Z\  versus  OA  is 
shown  in  Figure  1  and  shows  some  non-linearity.  This  smooth  and  all  smooths  is  this  paper 
were  computed  using  a  cubic  spline  smoother,  described  in  the  next  section. 

A  shortcoming  of  this  technique  is  that  E[Z\  |OA)  w  /i(OA)  only  if  a  linear  function  for 
AEO  is  appropriate,  and  similarly  for  /j (AEO).  What  is  needed  is  simultaneous  estimation 
of  fi(.)  and  /j(.).  This  can  be  achieved  as  follows.  We  start  with  initial  guesses  A(-)  and  /j(.) 
(for  example  fiiOA  and  friAEO)  and  corresponding  fitted  value  p,  then  construct  the  quantity 

Z  =  &+  A  (OA)  +  A  (AEO)  +  (2) 

Fixing  Z,  we  alternately  smooth  Z  -  A  (AEO)  on  OA,  Z  -  fi(OA)  on  AEO,  obtaining  new 
A(-)  and  A(*)’s>  until  the  smooths  don’t  change  much.  As  a  further  refinement  we  can  update 
Z,  replacing  the  old  A(-)’s  with  the  new  ones  (and  updating  p),  then  repeat  the  first  step,  and 
so  on  until  convergence.  Note  that  if  this  procedure  converges,  the  smooth  of  Z  —  fj(AEO) 
on  OA  is  fi(OA)  and  similarly  for  A  {AEO),  which  is  analagous  to  the  use  of  Z\  in  (1).  The 
resulting  function  for  OA  is  shown  in  Figure  2.  Notice  that  the  smooth  looks  quite  different 
from  Figure  1-  this  is  due  to  the  fact  that  the  smooth  for  AEO  (not  shown)  is  also  non-linear. 
These  data  are  analyzed  more  fully  in  Section  4. 

This  idea  is  the  heart  of  the  local  scoring  algorithm,  described  in  this  paper.  The  local 


s 


■coring  technique  is  used  to  fit  what  we  call  generalised  additive  models.  In  the  exponential 
family,  these  take  the  form  g{jt)  =  a  +  /<(*<),  where  the  response  Y  has  EY  —  ft,  the  *<’■ 

are  covariatee,  and  the  /<(.)’■  are  unspecified  smooth  functions.  These  models  are  an  extension 
of  generalised  linear  models  (Nelder  and  Wedderburn  1072)  which  assume  g(jt)  —  a+ Zifa. 
(In  generalised  linear  model  terminology,  g(p)  is  called  the  *  link  function*  ).  The  local  scoring 
algorithm  estimates  the  /<(.)’■  by  repeated  smoothing  of  a  suitably  defined  partial  residual. 
The  resulting  estimates  can  be  used  to  suggest  transformations  of  the  z’s  or  as  a  predictive 
model.  One  can  allow  a  smooth  estimate  for  all  of  the  covariates  or  force  a  linear  fit  for  some 
of  them.  Such  a  semi-parametric  model  would  naturally  arise  if  some  of  the  covariates  were 
categorical  in  nature  but  would  also  be  useful  if,  for  reasons  specific  to  the  data  at  hand,  a 
linear  fit  was  desirable  for  certain  specified  covariates. 

This  paper  presents  an  expository  view  of  generalised  additive  models  and  the  local  scoring 
algorithm,  with  an  emphasis  on  the  use  of  splines  for  smoothing.  In  section  2  we  describe  the 
local  scoring  algorithm  for  generalised  additive  models  in  the  exponential  family,  showing  its 
close  relation  to  the  Fisher  scoring  technique  for  generalised  linear  models.  Section  3  briefly 
describes  the  cubic  spline  smoother.  The  ovarian  cancer  data  is  analysed  more  folly  in  Section 
4,  illustrating  the  use  of  degrees  of  freedom  and  confidence  bands  for  model  assessment.  In 
Section  5  we  give  a  justification  for  the  local  scoring  procedure  based  on  the  notion  of  penalised 
likelihood.  Finally,  Section  6  discusses  extension  of  the  methods  and  their  relation  to  other 
work  in  the  literature. 


2.  The  Local  Scoring  Method  in  the  Exponential  family. 

In  order  to  define  the  local  scoring  algorithm,  it  is  convenient  to  first  describe  the  process 
of  repeated  smoothing,  called  backfitting  (Friedman  and  Stuetsle,  1981).  Suppose  we  have  data 
((xi,  iji, . . .  Sip), . . .  (xn,  *ni  • . .  *n*))  and  we  wish  to  fit  a  model  of  the  form  E(Z  \Xi,. . .  Xp)  — 
a  +  fj(Xj).  Let  S(Z  |  z)  represent  some  estimate  of  E(Z  |x),  for  example  a  cubic  spline 
smoother.  We  can  estimate  the  /;(.)’s  by  repeated  smoothing: 
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Backfitting  Algorithm 

mhHntiaa;  {//(.)  =*  0,  j  =  1, . .  .p)  ,  4  ■  everafe(s<). 

Gycta  j  *  1,2*  •  •  •  »Pt  l»2,  •  •  «p»  1»2»  •  •  *p 

r,  -  -  4  -  E** /*(**).•'  -  1. •  •  • » 

M*fl)  *  s(r » !**)»•  ■!••••• 

Until  JtSS  =  E*«i(*  “A"  E /*(**))*  eoDWin. 

Some  theoretical  reeulta  are  available  for  the  backfitting  algorithm.  Breiman  and  Fried¬ 
man  (IMS)  show  that  the  theoretical  version  of  backfitting  (Le.  with  conditional  expectation? 
instead  of  smoothers)  converges  to  the  beet  (in  L*  norm)  additive  approximation  to  Z  in  terms 
of  Xi,X%,...Xp.  Furthermore,  if  S(Z  \x)  represents  a  least  squares  fit  of  Z  on  x,  then  the 
backfitting  algorithm  can  be  shown  to  converge  to  the  least  squares  fit  of  Z  on  xj.xj,  ...x,.  In 
the  Appendix,  we  prove  convergence  of  backfitting  with  two  cubic  spline  smoothers  and  discuss 
the  general  (p  >  2)  case. 

Now  consider  the  class  of  generalised  additive  models  in  the  exponential  family.  We 
assume  that  for  a  fixed  value  of  the  scale  parameter  Y  has  adenaity  h(y,  fi)  in  the  exponential 
family,  with  EY  =  Var(Y)  —  V4  end  q  =  g(p)  =  a  +  Ei  The  deviance  of  a  model 

It  is  defined  by  dev(y,p)  =  2f^l  log  h(y.,  m) log  &(»,;*<)].  The  local  scoring  algorithm  uses 
iterated  backfitting  of  a  partial  residual  to  estimate  /t(-), .../,(.). 

The  Local  Scoring  Algorithm 
Initialisation  ff(.)  =  0 ,j  =  1, . .  ,p  ,  &  =  g(average(y)). 

Loop  over  outer  iteration  counter  m 

fT  *  &  +  E?  ST (*/.), =  9~x{nT) 

*  =  nT  +  (vi-  =  (ditT/dn?)*v-1 

Obtain  =  1, . . .p  by  applying  the  backfitting  algorithm  to  with  weights  u><. 

Until  dev(y,ft )  fails  to  decrease 
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Notie*  that  that*  ar*  really  two  noted  loop*  in  the  algorithm.  In  the  inner  (backfitting) 
loop  *  i*  held  fixed  and  the  /y  are  re-eetimated,  while  in  the  outer  loop,  q,  ft,  x  and  w  are 
updated.  The  weight*  ar*  appropriate  becauee  x  haa  variance  proportional  to  (dri/dn)2V  . 
The  quantity  x  ie  a  general  form  of  Landwehr,  Pregibon  and  Shoemaker'*  partial  residual  and 
ie  called  the  adjusted  dependent  variable  or  working  variate  in  the  generalised  linear  model 
literature,  hi  fact,  if  the  smooths  in  the  backfitting  algorithm  are  global  least  squares  fits, 
then  the  local  scoring  algorithm  reduces  to  adjusted  dependent  variable  regression,  the  GLIM 
package  implementation  of  maximum  likelihood  estimation  by  Fisher  scoring  (see  Nelder  and 
Wedderburn  1072).  This  would  of  course  be  an  inefficient  implementation  because  backfitting 
is  an  inefficient  way  to  find  the  least  squares  fit  of  Z  on  Xi,  Xt, . . .  Xf.  Note  also  that  if  h  is 
the  normal  density  then  *1  =  gu  then  the  algorithm  consists  only  of  the  inner  loop,  a  backfit  of 
x  on  zi,... gp. 

As  in  generalised  linear  modelling,  one  can  define  degrees  of  freedom  for  generalized  ad¬ 
ditive  models  and  obtain  an  estimate  of  it.  Briefly,  the  degrees  of  freedom  is  defined  as  the 
expected  decrease  in  the  deviance  and  is  computed  from  the  trace  of  an  appropriate  matrix 
(related  to  the  smoother).  The  theoretical  support  for  this  calculation,  however,  is  not  sub¬ 
stantial  and  it  is  meant  to  be  used  only  as  a  rule  of  thumb.  In  a  similar  fashion,  pointwise 
confidence  bands  can  be  estimated  for  the  functions.  More  details  on  degrees  of  freedom  and 
confidence  bands  may  be  found  in  Hastie  and  Tibshirani  (1986),  and  Hastie  and  Tibshirani 
(1985a). 

3.  The  Cubic  spline  smoother.  * 

The  local  scoring  algorithm  requires  an  estimate  of  a  conditional  expectation  in  the  back- 
fitting  loop.  In  this  paper,  we  will  discuss  the  use  of  cubic  spline  smoothers,  which  we  will 
review  briefly  here.  Note  however  that  any  other  reasonable  estimate  of  conditional  expecta¬ 
tion  could  be  used.  In  Hastie  and  Tibshirani  (1986)  we  used  running  lines  smoothers;  other 
candidate  smoothers  include  a  kernel  smoother  (see  e.g.  Cleveland  1979)  ,  or  a  smoother  such 
as  McDonald  and  Owen’s  (1984)  *  split  linear  smoother”  designed  to  reproduce  discontinu¬ 
ities.  One  could  also  use  different  smoothers  for  each  covariate —  for  example  a  *  wrap  around” 
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smoother  would  be  appropriate  for  a  periodic  variable  like  month  of  the  year. 

Given  data  (*i,*i),.  •  .  (xn,*n),  consider  the  following  minimisation  problem.  Find  h(x) 

to  mwimiM 

X>  -  + *  f*“  [»'(•)]’*  (») 

I  * -oo 

where  A  is  a  fixed  tuning  constant.  As  shown  in  Reinsch  (1967)  (Silverman  1985)  the  solution 
h(x)  is  a  cubic  spline  with  knots  at  some  of  the  *<’ s,  that  is,  a  piecewise  cubic  function  with 
pieces  joined  at  the  s.  The  parameter  A  trades  off  variance  and  bias  of  the  solution.  When 
A  =  0,  the  solution  is  any  interpolating  function,  while  if  A  =  +oo,  the  solution  is  the  least 
squares  line.  If  we  consider  the  value  of  h(x)  only  at  zi,  xj, . . .  Xn,  an  equivalent  form  of 
this  problem  is  the  following.  Let  h  =  (h(x i), . . . &(*»)),  *  =  (*i, . . .  *»)  and  K  be  an  n  x  n 
"penalty”  matrix  constructed  as  follows.  Let  ft  =  x,-+i  -  *  =  1, . . .  n  - 1,  A  be  a  tri-diagonal 

(n  — 2)  x  n  matrix  with  A«  =  1/hi,  =  -(l/ht  +  l/A<+i),  Ai|<+a  =  1/hi+i,  and  let  W  be  a 

symmetric  tri-diagonal  matrix  of  order  n  -  2  with  it<  =  Wt>i_i  =  A</6,  Wu  =  (ft  +  g,+i)/3. 
Finally,  let  K  —  A‘W-1A.  Then  the  minimizer  of  (3)  also  minimizes 

(s  -  h)*(z  -  h)  +  Xh'Kh  (4) 

Furthermore,  one  can  express  the  solution  K  as  Sz  where  5  =  (I  +  XK)~l,  I  being  the  n  x  n 
identity  matrix.  This  representation  is  useful  analytically  but  not  very  stable  computationally. 
For  the  latter,  an  algorithm  based  on  Cholesky  decomposition  is  preferred  (see  e.g.  Yandell 
1986). 

The  parameter  A  can  either  be  chosen  on  subjective  grounds,  or  by  cross-validation,  gen¬ 
eralized  cross-validation  (Craven  and  Wahba  1979)  or  asymptotic  generalized  cross-validation 
(Silverman  1985).  In  "automatic”  mode,  the  local  scoring  algorithm  uses  generalized  cross- 
validation  to  pick  A  each  time  a  smooth  is  computed. 


4.  Analysis  of  the  ovarian  cancer  data. 


We  now  analyze  the  ovarian  cancer  data  discussed  in  Section  1.  Risch  and  co- workers 
analyzed  a  case  control  study  of  987  women  in  Washington  and  Utah.  They  interviewed  284 
women  with  ovarian  cancer  and  703  controls,  and  recorded  the  following  covariates:  number 
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of  children,  number  of  miscarriages,  number  of  months  of  lactation,  obesity  (FAT),  oral  con¬ 
traceptive  usage  and  age  at  end  of  ovulatory  period  (ABO).  The  women  were  also  frequency 
matched  by  age  category  and  state  of  residence.  An  estimate  of  ovulatory  age  (OA)  was  con¬ 
structed  using  all  the  variables  except  AEO  and  the  main  hypothesis  was  that  OA  would  be 
related  to  the  incidence  of  ovarian  cancer. 

Table  1  shows  an  analysis  of  deviance  for  a  number  of  logistic  models  fit  to  these  data 
(see  Breslow  and  Day  1980  for  more  details  on  using  logistic  regression  models  in  case  control 
studies).  All  models  in  the  table  include  dummy  variable  to  account  for  the  matching.  The 
first  3  lines  of  the  table  indicate  that  OA  is  an  extremely  important  factor,  after  adjusting 
for  the  matching  variables  and  obesity.  Figure  3  shows  the  smooth  estimate  for  OA.  The  risk 
increases  with  ovulatory  age  but  levels  off  at  about  35  years.  Comparison  of  lines  2  and  3  of 
the  table  confirm  that  there  is  a  non-linear  effect  of  OA.  The  upper  and  lower  curves  in  Figure 
3  represent  9596  confidence  bands  mentioned  in  Section  2. 

The  middle  section  of  Table  1  shows  that  even  when  an  adjustment  is  made  for  ABO,  OA 
is  still  extremely  important.  However,  the  adjustment  for  AEO  removes  the  plateau  behaviour 
of  OA:  the  smooth  (not  shown)  looks  much  like  that  in  Figure  2.  The  smooth  for  ABO  (Figure 
4)  is  also  non-linear.  The  downturn  at  about  age  45  may  be  due  to  the  fact  that  some  women 
who  stop  ovulating  at  an  early  age  do  so  because  they  have  ovarian  cancer.  The  last  section  of 
the  table  examines  the  effect  of  entering  the  remaining  variables  into  the  model.  The  deviance 
decrease  (compared  to  the  second  line)  is  significant  indicating  that  OA  may  not  fully  capture 
the  effects  of  the  other  variables. 

Note  that  simple  interactioils  can  be  modelled  bj  taking  products  of  variables  and  treating 
the  product  as  a  new  covariate.  Alternatively,  we  can  fit  models  to  subgroups  of  the  data.  This 
might  be  useful  here,  for  example,  in  examining  a  possible  interaction  between  OA  and  FAT. 
A  more  extensive  analysis  of  these  data,  with  an  emphasis  on  the  medical  aspects,  will  appear 
elsewhere. 


>•-  «'  ,  r,  v  .  if,  f,  fm  A 
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5.  Justification  of  local  scoring  through  penalized  likelihood. 

In  Hastie  and  Tibshirani  (1986)  we  viewed  the  local  scoring  procedure  aa  an  empirical 
method  for  minimising  the  expected  log-likelihood  of  the  data.  When  a  linear  smoother  such  as 
a  cubic  spline  smoother  is  used  in  the  algorithm  an  alternative  motivation  based  on  penalized 
likelihood  can  be  derived. 


Let  /(#)  be  the  log-likelihood,  where  9  =  a  +  /y(xy,),  and  let  a  =  average(y).  Let  Ki 

be  the  n  x  n  symmetric  penalty  matrix  defined  in  Section  3,  and  /,-  =  (/y(iyi), . . .  /y (*yn))»i  = 
1,2, ...p  and  consider  the  following  problem.  Find  fu--  -Ip  to  maximize 


'(*)  -  *  £>/'*./. 


Letting  A  =  E(-tPl/d09t),  a  diagonal  matrix  with  diagonal  elements  a,-,  we  show  that  a 
Fisher  scoring  step  is  achieved  by  applying  the  backfitting  algorithm  to  appropriate  adjusted 
dependent  variables.  Rather  straightforward  calculations  show  that  the  Fisher  scoring  step  to 
go  from  tfd,  to  ....  /p"**  » 


A  +  AiRx 


A  +  A  jiifj 


/rw  -  ft*' 
irm-itd 


«-A  xKxft*' 
«  -  A 2K2f?d 


...  A  +  XpKp  J  \f™-f?dJ  U-XpKpf? 


where  u  =  Carrying  the  ffd  terms  to  the  right  hand  side  we  get  p  equations  (A  + 
A jKj)f™w  +  A  fp*w  =  As,  j  =  l,p  where  z  =  9otd  +  A_1u.  These  can  then  be  written 


sxiz-'Zwfr) 

sa(z-E^fr) 


V/rv  \Sp(*-z„p/r)J 

where  Sj  =  (A  +  XjKj)~lA.  Aa  noted  by  Green  and  Yandell  (1985),  Sj  computes  a  weighted 
cubic  spline  smooth,  with  weights  a*.  The  backfitting  algorithm  is  a  iterative  method  for 
solving  this  system  of  linear  equations.  In  fact,  the  backfitting  step  (7)  is  exactly  a  Gauss- 


Seidel  method  solving  a linear  system.  The  linear  system  that  it  solves  can  be  written  as 


which  is  just  (7)  rearranged. 

The  preceding  analysis  holds  for  any  penalty  matrices  Ki\  each  matrix  determines  a 
corresponding  smoother  by  Si  =  (A  +  A)-1A.  Conversely,  given  a  smoother  matrix  Si,  the 

corresponding  penalty  matrix  is  given  by  K,  =  (1/A,)A(5<_1  -  I).  Following  Reinsch  (1967), 
one  can  also  show  the  the  load  scoring  procedure  maximizes  1(9)  -  (1/2)  A,-  /*“(/"  (s)]Jds, 

analogous  to  the  the  cubic  spline  problem  (3). 

6.  Discussion. 

Local  scoring  *br  generalized  additive  models  provides  a  flexible  method  for  identifying 
non-linear  covariate  effects  in  a  variety  of  modelling  situations;  notably  the  very  situations 
in  which  it  has  become  popular  to  use  the  generalized  linear  or  GLIM  models.  The  additive 
models  can  be  used  in  a  data  analytic  fashion  to  understand  the  effect  of  covariates,  and  to 
test  hypothesis  about  effects.  A  more  conservative  approach  is  to  allow  the  non-parametric 
functions  to  suggest  parametric  transformations,  and  then  proceed  with  the  usual  linear  anal¬ 
ysis  on  the  transformed  variables.  The  local  scoring  idea  is  a  very  general  one,  and  can  be 
applied  in  any  situation  in  which  the  criterion  being  optimized  depends  on  one  or  more  smooth 
functions  (see  Hastie  and  Tibshirani  1986  for  details). 

When  cubic  spline  smoothers  are  utilized,  local  scoring  is  closely  related  to  recent  work 
by  O’Sullivan,  Yandell  and  Raynor  (l986),’Green  (1985)  and  Green  and  Yandell  (1985).  In  the 
first  paper,  a  multidimensional  “thin  plate  spline”  is  used  in  a  generalized  linear  model  setting. 
The  last  two  papers  describe  more  general  procedures,  with  an  emphasis  on  semi-parametric 
models,  i.e.  models  involving  a  linear  component  and  a  single  smooth  component.  In  fact, 
Green  derives  backfitting  equations  analogous  to  (7)  in  a  special  case  of  a  model  with  one 
smooth.  In  this  special  case,  the  only  difference  between  Green’s  method  and  local  scoring 
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with  cubic  spline*  is  the  method  for  choosing  A.  Green  uses  e  quadratic  approximation  to  the 
generalised  cross-validation  score,  o t  the  form  deviance /v where  v  is  the  estimated  degrees 
of  freedom  of  the  model.  Since  the  linearisation  step  in  the  local  scoring  is  tantamount  to  a 
quadratic  approximation  to  the  deviance,  the  two  methods  are  not  likely  to  differ  by  much 
However,  the  degrees  of  freedom  v  is  quite  difficult  to  compute  when  more  than  one  smooth 
is  present  in  the  model.  More  generally,  the  generalised  additive  models  framework  (with 
local  scoring)  differs  from  these  approaches  in  that  a)  it  emphasises  additive  mode'*  b)  it 
can  incorporate  multiple  smooths  through  the  use  of  backfitting,  and  c)  it  can  incorporate 
non-linear  smoothers. 

A  certain  amount  of  theory  already  exists  for  these  models,  notably  uniqueness  of  the  best 
additive  approximation  at  the  model  and  rates  of  convergence  for  parametric  sub-models(Stone 
1985).  More  theoretical  work  is  needed  to  refine  the  degrees  of  freedom  and  confidence  band 
computations  as  well  as  to  understand  the  effects  of  collinearity. 


SOFTWARE 

All  the  computations  in  this  paper  were  performed  by  the  Fortran  program  GAIM  (Gen¬ 
eralized  Additive  Interactive  Modelling),  a  package  available  from  the  authors  upon  request. 
An  IBM  PC  version  of  GAIM  is  also  available. 
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Appendix 

Convergence  of  backfitting  with  two  cubic  spline  smoothers 

The  cubic  spline  smoother  is  constant  preserving,  i.e.  SI  =  1.  For  definitiveness,  then, 
we  work  with  the  centered  version  of  each  smoother 

s;  =  (r~^)sj  (9) 


In  addition  we  assume  that  the  components  of  each  z%  are  in  the  same  order  as  the  components 
of  a.  Hence  S/  will  really  denote  PflSfPi  where  Pi  is  the  permutation  matrix  that  sorts  a 
in  the  order  of  *4.  (This  is  fine  since  PflS- Pi  has  the  same  principal  value  decompoeition  as 
5/).  Noting  that  5/21  =  0,  the  backfitting  process  consists  simply  of  the  alternating  steps 

fl  =  Sl(x  -  h) 

h  =  s;{z-h) 

Starting  with  initial  values  and  /j ,  if  and  denote  the  estimates  at  the  mth  stage  of 
the  backfitting  algorithm,  then  it  is  straightforward  to  show  that 


(10) 


i=0 

/r  =»  -  £(s}stf(/  -  s;)x  -  (s;s;r% 

J=0 


(11) 


(12) 


Let  ||C||  =  supa[o*ClCa]/a*ol  the  natural  norm  of  the  matrix  C.  Then  //*  and  fp  will 
converge  if  ||5/5/||  <  1  and  ||5/S/||  <  1.  If  this  is  the  case,  we  have 

i?  =(/ -(/- s;s;rlv  - 

If  5/  and  5|  have  principal  values  values  <  1,  the  conditions  ||S|52||  <  1  and  HSjS/H  <  1 
say  that  the  spaces  of  of  vectors  whose  length  is  preserved  under  each  mapping  are  disjoint. 
We  now  show  that  a  cubic  spline  smoother  matrix  has  real  positive  eigenvalues  less  than  or 
equal  to  one  and  furthermore,  j|Sx|j  <  ||x||  unless  a  is  a  linear  function  of  z.  We  can  verify 

this  through  the  representation  S  =  [I  +  A*W“1A)_1  where  W  and  A  are  defined  in  Section 

• 

3.  First  note  that  W  is  positive  definite  since  it  is  diagonal  dominant  (i.e.  the  sum  of  each  row 
is  <  the  diagonal  element  in  that  row).  Thus  W-1  exists,  A<W-1A  is  non-negative  definite 
and  hence  (/  +  A‘W-1A)-1  has  eigenvalues  <  1.  Now  suppose  (/  +  A,W-1A)-1a  =  z.  Then 
A‘W_lAa  =  0  ,  *‘A‘W-lAz  =  0  and  thus  A*  =  0  (since  W  and  hence  W~x  are  positive 
definite).  Now  A  takes  second  differences  and  hence  a  hence  must  be  a  linear  function  of  *. 

Since  we  have  removed  the  eigenspace  corresponding  to  the  constant  eigenvector,  we  see 


that  backfitting  will  only  fail  to  converge  if  *\  =  c i*j  +  cj  for  some  c\  and  cj. 

In  a  backfitting  algorithm  involving  p  cubic  spline  smoothers,  we  have  have  been  unable 


12 


to  prove  convergence.  However,  we  ere  able  to  prove  convergence  for  •  modified  (and  more 
efficient)  version  of  backfitting.  Details  are  in  Buja,  Hastie  and  Tibshirani  (1986). 

Note  that  one  can,  in  theory,  avoid  iteration  in  the  backfitting  loop  through  use  of  formula 
(12).  Unfortunately,  these  expressions  are  formidable  to  compute,  requiring  the  inversion  of 
n  x  n  matrix.  However,  Green  and  Yandell  (1985)  show  than  in  the  special  case  in  which  S\ 
represents  a  cubic  spline  smooth  and  Sj  a  least  squares  projection,  one  can  compute  these 
expressions  explicitly  in  only  O(n)  operations. 


i 
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Table  1 

Analysis  of  deviance  for  ovarian  cancer  data 


Model 

Residual 

deviance 

Degrees  of 
Freedom 

FAT 

H448 

977.0 

FAT,  0A<*-+ 1.25) 

1129.0 

9747 

FAT,  OAdlnear) 

11348 

976.0 

FAT,  AEO<*«4.65) 

1133.7 

973.2 

FAT,  AE0U=4. 65),  0A(A-*1.25)  11 15.3 

970.9 

all  covariates  Including  OA 

11043 

967.3 

f(oa) 


30 


40 


50 


60 


oa 


smooth  for  OA  with  AEO  smooth 


f(oa) 
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